Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.9 - Filling in Null Values")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

pets = spark.read.csv(path, header=True)
pets.toPandas()

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None
3	3	2	Maple	2018-11-22 10:05:10	17	white
4	4	2	None	2019-01-01 10:05:10	13	None

Filling in Null Values

Working with real world data, some information can be missing but can be interprelated from other columns or set with default values. These interprelated values or deafult value will thus fill in those missing values. Here we will show you how to.

Option 1 - Fill in All Missing Values with a Default Value

(
    pets
    .fillna('Unknown')
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	Unknown
2	3	1	Chewie	2016-11-22 10:05:10	15	Unknown
3	3	2	Maple	2018-11-22 10:05:10	17	white
4	4	2	Unknown	2019-01-01 10:05:10	13	Unknown

What Happened?

Using fillna we attempt to fill in all Null values with the value 'Unknown'. This will be fine if all the Null values are strings, but it won't work if for say the age column is missing values. We look at how to specify different values for different columns next.

Option 2 - Fill in All Missing Values with a Mapping

(
    pets
    .fillna({
        'nickname': 'Unknown Nickname',
        'color':    'Unknown Color',
    })
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	Unknown Color
2	3	1	Chewie	2016-11-22 10:05:10	15	Unknown Color
3	3	2	Maple	2018-11-22 10:05:10	17	white
4	4	2	Unknown Nickname	2019-01-01 10:05:10	13	Unknown Color

What Happened?

You have the option of filling in each column with a diffferent value. This provides more flexibility as most times the columns will be different types and a single deafult value won't be sufficient enough.

Option 2 - `coalesce()`

(
    pets
    .withColumn('bogus', F.coalesce(F.col('color'), F.col('nickname'), F.lit('Default')))
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color	bogus
0	1	1	King	2014-11-22 12:30:31	5	brown	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None	Argus
2	3	1	Chewie	2016-11-22 10:05:10	15	None	Chewie
3	3	2	Maple	2018-11-22 10:05:10	17	white	white
4	4	2	None	2019-01-01 10:05:10	13	None	Default

What Happened?

Another way to fill in a column with values is using coalesce(). This function will try to fill in the specified columns by looking at the given arguments in order from left to right, until one of the arguments is not null and use that. If all else fails, you can provide a "default" value as your last arugment (remembering that it should be a columnar expression).

In our example, it will attempt to fill in the bogus column with values from the color column first, if that is null then try the nickname column next, and if both are null it will use the deafult value Default.

Summary

We looked at a generic way of filling in a columns with a single default value.
We looked at providing a mapping of {column:value} to fill in each column seperately.
Lastly we looked at how to fill in a column using other columns and default values in an order of precedency.

Section 2.9 - Filling in Null Values

Library Imports

Template

Filling in Null Values

Option 1 - Fill in All Missing Values with a Default Value

Option 2 - Fill in All Missing Values with a Mapping

Option 2 - `coalesce()`

Summary

results matching ""

No results matching ""

Library Imports

Template

Filling in Null Values

Option 1 - Fill in All Missing Values with a Default Value

Option 2 - Fill in All Missing Values with a Mapping

Option 2 - coalesce()

Summary

results matching ""

No results matching ""

Option 2 - `coalesce()`